Introduction
The Case of Jack the Ripper:
In 1888, an unknown killer caused fear and mayhem in the streets of London after five women were murdered. The killer was known only to the public as Jack the Ripper. Jack the Ripper is one of the most famous unsolved mysteries of all time. This case has perplexed detectives and scholars alike for the past 130 years. The authorities of the time had unsophisticated techniques for collecting evidence and were never able to narrow in on one suspect. Very little still exists that might be able to finally catch this age-old killer. Jack the Ripper often taunted the investigators of his (or possibly her) crimes through letters, and these letters still exist to this day. Through using data mining techniques, we will compare the famous Jack the Ripper letters with writings and other forms of prose from known suspects and see if a prolific killer is among them.
Primary Source Documents
- The three Canonical Jack the Ripper Letters
- Writings and Quotes from six suspects
- Prince Albert Victor
- Lewis Carroll
- Joseph Barnett
- Carl Feigenbaum
- Mary Pearcey
- Walter Sickert
We began our data collection process by acquiring the texts from the original Jack the Ripper. Then, we put the letters into individual text files. Next, We researched prominent suspects in the Jack the Ripper case. Once we located the suspects, we then acquired writing and quotes by these suspects. Our data set includes writings, testimonies and quotes from 6 different suspects. All suspect primary source documents were broken into individual text files.
Data Preprocessing
The next steps was to get our data into a useable format. This required a few packages in R, “tidytext”, “readtext”, and “tidyverse”. Using “readtext”, text files can be read in and formatted. Then using “tidytext” and the “tidyverse”, we were able to manipulate the data into word frequencies. Once the dataframe was set into a usuable format, we then transformed our data using a min/max transformation.
Exploratory Analysis
Jack the Ripper Word Cloud:

- Created through using the package
wordcloud.
- These are the highest frquency words in the letters.
- Jack the Ripper often used the words “ha ha” to mock police.
Kmeans Cluster
- We chose K means clustering because K means an algorithm often used for text mining due to it’s ability to manage unstructured data.
- K means is also a great algorithm to use when exploring your data.
- For our K means model, We used to 10 clusters.
- Accuracy = 85.41%
- Looking at the visualization, none of the suspects cluster with Jack the Ripper.
- We will employ other models to see if these results are consistent.
Decision Tree
prediction Jack the Ripper
Carl Feigenbaum 0
Joe Barnett 0
Lewis Carroll 0
Mary Pearcey 0
Prince Albert 1569
Walter Richard Sickert 0


- We chose decision tree, because decision trees are able to determine classifications in a more straightforward manner than other classification algorithms.
- Accuracy: 91.1%
- Predicts Prince Albert as Jack the Ripper 66.7% of the time.
- Predicts Lewis Carroll as Jack the Ripper the other 33.3% of the time.
- The model does not implicate any of the other 4 suspects.
Support Vector Machine (SVM)
- We chose SVM because can handle both linear and nonlinear methods.
- Accuracy = 40%
- SVM also predicts Lewis Carroll to be Jack the Ripper.
Comparison Analysis
K means analysis is quick with easy to read results and did not show that any suspects clustered with Jack the Ripper.
Decision Tree has a longer runtime with higher accurracy and implicates Prince Albert as Jack the Ripper 67% of the time with Writer Lewis Carroll 33% of the time.
SVM had the longest runtime with 40% accuracy and predicted Lewis Carroll to be Jack the Ripper 100% of the time.
Conclusions
Did we finally solve the mystery of Jack the Ripper?
Kmeans had 85.41% accuracy and did not find any suspect clusters who overlapped with Jack the Ripper.
Decision Tree had 91% accuracy and had leanings toward Prince Albert as the most likely match to be Jack the Ripper by 67%. 67% still leaves room for reasonable doubt.
SVM had low accuracy (40%) but classified writer Lewis Carroll as Jack the Ripper 100% of the time.
With all models generating different results, it is difficult to say beyond a reasonable doubt who Jack the Ripper really was.
The jury is still out on this 130 year old mystery.